On Text Coherence Parsing

نویسنده

Udo Hahn

چکیده

In this paper global patterns of thematic text organization me considered within the framework of a distributed model of text understmlding, ttased on file parsing resuhs of prior lext cohesioll analysis, specialized text grammar modules deternfine whether SOllle well-defined text lllltCro organization palteln is cOlll])ulabte flOlll the available text representatiolt SIrUCttlreS. The model underlying text coherence parsing formalizes hither to entirely intuitive textlinguistic notions whose origin can be traced back to Danes's work on thematic progression patleHlS. 1 I N T R O D U C T I O N l)ufing tim last years it has become increasingly apparent that dialog and text understanding systems must account Ior connectivity relations that extend over sentence boundaries. This has led tn a bulk of work dealing with varinus forms of cohesimt-preserving language mechanisms, lnainly in the field of anaphora, which contribute to connectivity among sentences. From the focus on these linguistic phenomena one might obtain a misleading picture of textual connectivity, viz. one that considers it basically as a 'fiat', continuous streanl of formally connected utterances lacking additional structure. Far less research has been devnted to the intemM organization of cohesive utteranccs by mechanisms at a more global level of dialog/text architecture, the level of text coherence. Major computational approaches rclated to co-. herence aspects within a dialog processing framework am due to Reichman's [1978], McKcown's [19851 and Scha & Polmlyi's [19881 lbnnalizations of diatog grammars. Coherence criteria of written texts llavc [)cell illvestigated ill tile context of 'Rhetorical Stmcttlm Thet) ry' IMann & Thompson 19881 and related extensions le.g., Altemlan 1982, Tucker, Nirenburg & Raskin 19861 of tile original theory of coherence relations ill discom.'se [Hobbs 1982]. A second major methodok)gy which deals with the global stnactufing of written texts is the model nf text lnacro propositions and superstnlcturns [Kintsch & wm Dijk 1978, van Dijk 19801, tilt latter sharing all relevant pmtxmies one generally altriDutes to story grammars [Runlelhari 19']51. The problem with this kind of methodology is that, unlike the coherence relation approach, the grammat,'s which have been proposed so far are litirly idiosyncnttic 10r each application dmnain (narratives, weather reports, etc.). Cornmolt to all these approaches is the requirement of a deep, propositionally guided understanding of the underlying discourse; in particular, a complete theory o1' its dontain and an exhaustive specifieatitm of a natural language grammar must be supplied in order to guaraw tee proper operation of implemented systelns. This AcqEs In! COL1NG-92, NANI[.:S. 23-28 aot~q' 1992 2 5; might explain wily, with only low exceptions, these UItK|C1S Of text coherence have resisted lu l lher coral)ilia li0nal treatment as evidenced by Ol)cratiooal systems. We here make an alternative alld conlpt|tationa[ly more tractable l)rolx)sal on how it) deal wifll global text structures at the text coherei1ce level. Its roots Call be traced back to the seminal wolk of F, l)mms [ 1974], ill which he inl~lrmally deve, lnped tile notion of thematie progression patterm', distinguishinl; Delween three prototypical patterns, viz. constant theme, continuous tim realization of dxemes, and derived theme (see st:ctiou 3). The model outlined ill this paper stalls lmm a thor ough fi~rmalization of (one ol) these notioos and places it into the cnvimmnent of a fully operational wxt pars ing ,~\vste.t wtlose design is mainly oriented towards the proper l~cognitiotl of text cohesion aod coherence p h e nolnclla. Pe l l ioen t feasolls for our cl loiec oI: a 1)allen type model of text coherence ale: (1) The text parser fomls part of tile text nndelstanding system TOPIC. It operates ht a i~al.world doruain [Reimer & ]tahu 19881, i.e. textual input is taken fl'onl a t)crnlallcnt stream of test reports in major ( ;ennan i n fomladon technology magazines. As it seems that it will remain iu/'easible for a long time to come to provide exhaustive dt)tuain and grammar si~cilications lor routinely operating text understmulers, a palticularly robust partialpwwing approach capable o['handlinp potential specilication gaps has lreen adopted. These conditions obvkmsly preclude tile consideration of RST-style co hercnce relation COlUl)uting as a text coherence analysis stlategy, since relevant knowledge pmtions might be lacking lbr deteonming specific instances of coherence relations. Conversely, the coherence lelatiou appnlach seems currelflly itffeasiMe for tile rotlthle processing ol large-scale text collections in real donlaios. (2) Tile description of ctlherence structures in tenns o f coherence mlatiotls or text macro prolx)silitms requires the awlilability o[deet) m'seHiot~d knowledge from thcP application domain (A-bt)x level spccilications in KrypIon ternlinology; t:f. grachmau el al. [ 1985 It. The TOP l(~ systellI, [ltlwcver, emphas izes tile role o1: tetmDlo logiczd knowledge of its ¢[om~.iill, i.e. tile description ol prolotytiical plx)i)elties aIKl iuferellce rules related to baste conceptual HIIi[S of the domain (Krypton's T-/xlx level knowledge). As TOPIC is rather weak with re spool to lull-blown asseltiolm[ knowledge, coherelicc relation etm~puting, however valuable it might be, is currently out of reach for this systenl. Fortunately, Daues-type coherence t)allerns primarily t'eli:r to the level of tenninological knowledge. (3) l'rototypical patterns oI: themalic llrogression arc lairly gem:ral and independent Of particular domains ttlat exlx)sitory lexts deal with. l.ingttistic studies have PROC. OF COI,ING 92, NANTES, Aut;. 23-28, 1992 collected empirical evidence for this claim through investigations of texts from diverse domains [Giora 1983a, Kurzon 1984]. This coincides with the generality of use of most coherence relations, but is in sharp contrast to the highly constrained and domain-dependent model of superstructures and story grammars. (4) Major thematic progression pattems are correlated with particular search styles and retrieval modes in fulltext information systems. Hence, providing typed coherence operators inherently supports graphics-based user interactions with the TOPIC system in terms of advanced conceptual orientation and navigation tools for semantically guided text graph tours (see section 5.3). (5) The investigation of thematic progression pattems is of value in its own methodological right. They constitute a basic structural model of text macro organization as opposed to model-theoretic and plan/goal-based approaches (a distinction made by Pustejovsky [1987]). As such they might complement current text understanding methodologies whose emphasis, so far, has been on fairly knowledge-expensive assertional models (such as coherence relations and text macro propositions) or stereotyped text-semantical models (such as superstructures and story grammars). 2 MOTIVATING T H E NEED F O R T E X T C O H E R E N C E P A R S I N G Tbe model of text structure parsing we propose draws a careful distinction between text cohesion and text coherence phenomena. As to the illustration of text cohesion mechatfisms in natural language texts, consider the following text passage: [1] The De / t aX from ZetaMachineslnc. is a computer system that mns Unix V.3. [21 ~h.e_Lw[9~ is based on a 68020 processor. [3] It has a 12-inch monochrome display and an integrated telephone handset and built-in modem. [4] Internally, there's a 40-megabyte hard disk, a 1.2megabyte 51/4-inch floppy disk drive, 4.5 megabytes of RAM, three RS-232C ports, and an S T-506 port. Repeated occurrences of various text cohesion phenomena are illustrated by nominal anaphora (7"he system' in [2]), pronominal anaphora ( ' /t ' in [3]), both referring to the unique antecedent Delta-X (in [1]), while '/nternally, there's a ... hard disk" (in [4]) is linked to Delta-X via textual ellipsis. The basic cohesion among these sentences yields the common thematic background for constantly elaborating on a single topic (Delta-X). An appropriate text parser should, first of all, recognize these multiple cohesion phenomena and produce something like the following representation structures (indicated by [...]R): II]R l)elta-X < manufacturer: { ZetaMachines Inc. } > Delta-X < operating system: { Unix V,3 } > [21R Delta-X < CPU: { 68020 } > 13}R Delta-X < peripheral devices: { 12-inch monochrome display } > Delta+X < peripheral devices: { telephune handset ] > Delta-X < e~ tmunica t ion devices: { modem } > 14]R Delta-X < external storage devices: { 40-megabyte hard disk } > Delta-X < external storage devices: { 1.2-megabyte 51/4-inch floppy disk drive } > Delta-X < main memory: { 4.5 megabytes of RAM } > Deha-X < ports: { 3 RS-232C ports } > Deha-X < ports: { ST-506 port } > Ac'r~!s DE C O L I N G 9 2 , NArcl'l~s, 23 -28 AoC-r 1992 2 6 What is still lacking is a representation facility which characterizes this sequence of single assertions constantly referring to a single topic (Delta-X) as constituting a coherent whole. Recognizing linguistic forms of text coherency and providing appropriate thematic grouping operators for text knowledge bases is what text coherence parsing mainly is about. Even if parsers would perfectly recognize and normalize all occurrences of text cohesion phenomena in texts, missing recognition capabilities for text coherence phenomena would nevertheless produce under-structured, incoherent text knowledge bases in the sense that global pragmatic indicatops of discourse bracketing would be lacking. 3 BASIC T E X T C O H E R E N C E P A T T E R N S In this section, we informally describe the basic patterns of text coherence focused on in this paper. According to Danes [1974] three categories of thematic developments can be distinguished: ~1 Constant Theme. This pattern is characterized by the con.~tant elaboration of one specific topic within a text (passage) by considering several of its conceptual facets. The following two paragraphs serve to illustrate this major pattern of thematic progression (the reference points to the constant theme (DeltaX) are indicated by italics): [TI. l l . The Delta-X from ZetaMachineslnc. is a multiuser, multitasking computer system that runs Unix V.3 and comes complete with most of the software needed for business applications. The combination host computer/workstation is based on a 68020 processor, with dual 68000 processors providing peripheral processing. It has a 12-inch monochrome display andan integrated telephone handset and built-in modem. Internally, there's a 40-megabyte hard disk, a 1.2megabyte 51/4-inch floppy disk drive, 4.5 megabytes of RAM, a network controller, three RS-232C ports, and an ST-506 port. 7] Cont inuous Thematizat ion of Rhemes. In contrast to constant themes, this pattern realizes a continuous shift of topics (visualized by bold italics). The process starts with a theme and ,some comment on that theme which we shall call theme (actu• ally, an elaboration on one of its conceptual facets). Now this rheme is focused on as the next theme that is elaborated by a corresponding rheme, etc.: IT1.2]. The $12,000 Delta-X host/workstation can be supplied from ZetaMachines Inc.. 2999 State St., Santa Barbara, CA 93105. Zeta-Machines" sales manager, Brian Wilson, says that they also plan to market the Gamma-Z, a CAD/CAM workstation based on a Connection Machine architecture. The underlying theoretical foundations are due to D. Hillis, a former M.I.T. student who first developed an experimental prototype based on connectionist principles. Derived Theme. Global text structure can also be introduced by a variety of topics which share conceptual commonalities (facets) at the knowledge repreSelltation level (not necessarily need this be paralleled with properties actually mentioned in the text!) without the general concept being explicitly stated in the text. Technically this is realized by a set of subPROC. Cq: C O L I N G 9 2 , NANTES, AUG. 2 3 2 8 , 1 9 9 2 ordinates or instances of a common (only implicit) supcrordinate/prototype. Suppose the iUuslrative text ITI] composed of its two constituent parts from above, [T1.1 ] and [T1.2], is augmented by ~vel~d paragraphs dealing with G a m m a Z and S i g m a P machines on a similar level of detail as those passageswl f ichcor~sider theDel ta-X in [TII: [T21. The DeltaoX from ZetaMachines... [1'1. I~TI.2] The Gamma-Z is a MS-DOS machine. Peripheral devices include an 8inch color display, a tmarix printer , and a key&)ar d . . . . The S i g m a P system makes available a lot of desirable application sz~ftware such as a ck~tatnt~e,~stem, word processing, and a variety of games . . . . This text implicitly has w o r k s t a t i o n as a derived lhemc, since that is the immediate prototype concept of those three instauees ( D e l t a X , G a m m a Z , S i g m a P ) explicitly menlioned in [T 2]. 4 T I l E K N O W L E D G E S O U R C E S I N V O I N E D IN T E X T P A R S I N G This section deals with the .knowledge sources involved in actually parsing a text. Basically (see Figure 1), these are constituted by the PARSE BULLETIN, a blackboard-type memory which records the single events of the parsing process, the DOMAIN KNOWLEDGE BASE, which contains file domain-specific background knowledge needed for the parse, and various EXPER~Ps for actually driving the parse through the text grammar specifications they incorporate (cf. tlahn [1990] for a more comprehensive presentation). The PARSE BULLETIN has a flat list struc. ture. It records the sequence of text tokens as they appear in the text and, if relevant (see below), notes their class identifiers (FRAME item, ADJective, etc.). More imlxmant, cox~structivc parsing activities based on operations of the knowledge base and the parser are indicated at ~ver',d positions (so-called parse points) in the PARSE BULLETIN. The type of operation being performed is indicated by a particular parse descriptor. Some are internal to the management of the knowledge base, e.g., DEFACF (default concept activation), while others indicate grammatical relations recognized by tile parser, such as NounA'Vl' (conccptu~d attribution relations between nouns), AdjA'FI' (conceptual attribution relations between adjectives and nouns). The items alZ lcctcd by an operation lorm a so-called parse mple. The parser does not consider every token it receives from the input text at the same level of detail. Instead, it distinguishes between words which am signilicant to its performance (conceptually relevant ones, such as nouns or arljcctives which denote concepts in the domain knowledge base, or linguistically relevant ones, such as negation particles, certain conjtmctions, quantiliers, etc.), and tho~ that are not (anrong them a wide variety of semantically indifferent nouns, verbs, particles, etc., each of which is assigned the class identifier NIL). The latter are simply discarded from further analysis, while the fom~er arc assigned lexicalized grammar spccificafiorts. The parser h~s thus been tuned towards partialparsing in a spirit similar to that advocated by Schank ct al. [19801 and achieves text understanding primarily on a terminological levcl of knowlcdge representation. ACfE,~ I ~ I ~ C ( ) I _ , I N ( ; 9 2 , N A N i ' I ! S , 2 3 2 8 A ( l ~ r 1 9 9 2 2 7 pARS1] BUI , I , I~TI~ {0o0] 0 FOP [0Ol] "11~ Nil . [002[ I~lt~-X FRAMF. [002-1] l'J~ll|.X III D~iFACI' 1003] from NIL [0041 7 z ~ l l c h ~ [zrt~ ['3lAME [004.11 7~ut Miw~mol ]hte. Ill D 'EI:ACI' [004.2] Delta-X t21 < nt~uauf~:iut~ I11'. ( Ze~M~cllkl~ ln~ I11 } ~. NeanA'rl" [010.3] I)~.lta-X ~41 < ul i s¢ mode [11: { mull i tu~ } ~. ~ I j N I T [010.4] I~l ta .X [5~ < ~ t i n g mod~ I1 I: ( nmldt~kiag ) > AdjAq[q' . . . . . . I [013.2) I~lot-X ~6~ < ~ t i n l g system[ I11: { Unix V,3 I11 } > l q~nA3T ! 1033.3] D e l t a X l g l < l n ~ . ~ r l l f l : [ 68020111) > NouttA'Fl' [037.31 ~ 0 a X I101 < i ~ c t 12J: { 68020 HI. 6800~.1 Ill, 611000.2-Ill ] > NoudA'l'l' [039.2} 68000-1 121 < functimx[ll: { pmiphe~al p r o o f i n g } > NounA'l'l" [039.3] 68000~2121 < function II1: { peripheral t~ooenliug } • N~ l t&q; ' l [046.3] dh~lay-I t2J < p t ' ~ t ~ t i ~ roode I11: { n~aochro~a~ } > AdjA'lq' I046.4] I~[ta-X I111 < i/o t l ~ c ~ I11: { di~day-I Ill ] • NounA'Fr [046,5] Dclla-Xl|ll NounA'IT [050.3] l~ll~-X I 121 < p,eriph~l deviot~ 12i: [ dilplltyI I11, te le l~r i¢ I11 } :. NotulA'l'f i0"53.2/ D~lt~ XII31 . . . . . . . icali~nd~vi~lTJ: [ Uelephunelll, n ~ d ~ t l l t } > "N~nATF [053.3] IXelt~t-X I131 < pe~iph~l d~vic~ 131: { display1111,.... n ~ l m n Ill } > NounA'l'l' [054] . PUNC-r [055[ 0 EOP DOMAIN KNOW1.BDGB BASB ('D~ii~,-~ [131 1 < Elf; I -Wo~t l l i . i i > < CPU I1 I: { 68020 Ill } [n-I~occ~uu~" I > F i g u r e 1 A Snapshot o f the Parser (also Pre-Conditions l lolding with rea?~ect to a C'otL~tant Theme Pattern) The DOMAIN KNOWLEDGE BASE (KB for shoo) contains frame representation structures. E~:hframe identifier (in bold face) is assigned a list of slots (enclo~d by angular brackets). Them sioLs are associated with two different kinds of slot fillers. Permit ted slot fillers are enclomd in square brackets, [a-framo namo], which characterizes the range of possible slot fillers by ,all those fr~mles which ale a sulx)rdinate or an instance of framo name. Actual slot fillers are enclosed in curly braces and can be taken as facts either known a prk)ri to ll~c system or acquired continuously from the text as its understanding proceeds during file parm. In addition, each concept has attached to it an a.'~ tivation weight counter. The values of the weight fac~ [ors are enclosed by vertical bars attached to each item; if no bars explicitly occur, a zero weight is assumed. Activation weights arc incremented (starting from zerolevel activation) whenever a noun denoting its associated concept occurs in the text, and whenever structurebuilding operations in KB aflect that concept. The ma~ I 'ROC. OF C O I , I N G 9 2 , N A N ' I E S , A t l ( ; . 2 3 . 2 8 , 1 9 9 2 nipulation of act ivat ion weights serves several pur~ poses, the major ()tie being their use as an indicator of salience of concepts during rite text condensation phase, (luring which text summaries are generated flom the text representation structures resulting from lhe text parse [Reimer & tlahu 19881. The text grammar is composed of a set of distributed graulmar experts, cach one responsible for sortie specific linguistic function (e.g., concept attribution via nominal, adjectival or prepositional phrases, mlaphora). Each expert ix characterized by a unique E X P E R T N A M E trod ix activated by a message event, i.e., by receiving a message text which nifty contain some parameters. 111 order to check its conlt~tence in contributing to the parse, p r e e n n d i t i o n s com[xrsed of complex test predicates are evaluated. If these pre-conditions hold for that expert, the pos t cond i t i ons immediately apply, i.e. messages are sent to qualified actors (to other grammar experts, to the domain KB or to the bulletin). 5 A D I S T R I B U T E I ) M O D F , I~ O F T E X T C O t l E R E N C E P A R S I N G fil this paper, we shall not go intn the details of phrasal, clausal, and text cohesion parsing (of. l lahn [ 1989] lot fin in-flcpth coilsideration of related technical issues). hlstead, we assume that these preliminary activit ies have aheafly t een carried out properly arid lhat sonic initial strnctural representation is already available from tile bulletin. These requirements are fulfilled in the snapshot of the PARSE BULLETIN in Figure 1, taken after all local parsing events have terminated; dlis characterizes a state ready to tune to the activation o[ global text stnlclure computing e x p e r t s . We here consider the end of the paragraph (denoted by the symbol 0 and the class identil ier EOP) as an lulchoring point for coherence computation. It is motivated hy the observation that -at least in tile sublanguage domain we are currently working in -major tnpic movements occur predominant ly fit paragrat)h boundaries. This coincides with linguistic evidence for the (text)grammatical status o1: paragraphs [tlinds 1979, Giora 1983b, and Zadrozny & J c n s e n 1991]. Therelore, the proper rccogalition of textual macro structures is always initialized at the end o f a paragnq)h. 5 . 1 C o n s i d e r i n g C o n s t a n t T h e m e Constant themc is a coherencc pattern which is characterized by multiple occurrences of a singlcJJ'ame in tt~ PARSE BULLETIN within one paragraph. Most of its occurrences, in turn, arc accompanied by a slot and/or slot fillet" indicating that some knowledge base operation with respect to.9~ame has ficcn carried out in KB (e.g., slot filling as indicated by NounA'lT or AdjA'IT for which wc shall introduce the LC* descriptor as a convenient shorthand notation). It is the cnnti lmous elaboration of that particular conccpt that makes the corresponding text passage coherent. While tbe bulletin maintains file sequential order of these (,pclations, KB provides the conceptual background lot coulinuous references to Ihe same frame object. Vigure 2 visualizes the description for c o n s t a n t theme; the DOMAIN KNOWI,EIXiE BASE window displays fill properties of f r a m e dealt with in a text (passage) in the shadowed area of the frame Ix)x, while those ilot mentioned in tile text are in tile remaining white pat~t. Consequently, it is neither neccssaly Ihat all A c l l i s tIE C O L I N G 9 2 , N A N T E S , 2 3 2 8 AO(JT 1 9 9 2 2 8 slots of a frame awulablc in the knowlcdgc basc be referred to in the text (as with sloth41 . . . . . . ~'lotm), nor that there t)e any ordering constraint relating single slots of a fl'amc in KB to thc sequence of slot filling operations in the PARSE BUI,LETIN.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

Learning to Rank Semantic Coherence for Topic Segmentation

Topic segmentation plays an important role for discourse parsing and information retrieval. Due to the absence of training data, previous work mainly adopts unsupervised methods to rank semantic coherence between paragraphs for topic segmentation. In this paper, we present an intuitive and simple idea to automatically create a “quasi” training dataset, which includes a large amount of text pair...

متن کامل

Topic Essentials

An overview of TOPIC is provided, a knowledge-based text information system for the analysis of Germanlanguage texts. TOPIC supplies text condensates (summaries) on variable degrees of generality and makes available facts acquired from the texts. The presentation focuses on the major methodological principles underlying the design of TOPIC: a frame representation model that incorporates various...

متن کامل

Textual Expertise In Word Experts: An Approach To Text Parsing Based On Topic/Comment Monitoring

In this paper prototype versions of two word experts for text analysis are dealt with which demonstrate that word experts are a feasible tool for parsing texts on the level of text cohesion as well as text coherence. The analysis is based on two major knowledge sources: context information is modelled in terms of a frame knowledge base, while the co-text keeps record of the linear sequencing of...

متن کامل

Heuristic search in a cognitive model of human parsing

We present a cognitive process model of human sentence comprehension based on generalized left-corner parsing. A search heuristic based upon previouslyparsed corpora derives garden path effects, garden path paradoxes, and the local coherence effect.

متن کامل

برچسب‌زنی خودکار نقش‌های معنایی در جملات فارسی به کمک درخت‌های وابستگی

Automatic identification of words with semantic roles (such as Agent, Patient, Source, etc.) in sentences and attaching correct semantic roles to them, may lead to improvement in many natural language processing tasks including information extraction, question answering, text summarization and machine translation. Semantic role labeling systems usually take advantage of syntactic parsing and th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1992

On Text Coherence Parsing

نویسنده

چکیده

منابع مشابه

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Learning to Rank Semantic Coherence for Topic Segmentation

Topic Essentials

Textual Expertise In Word Experts: An Approach To Text Parsing Based On Topic/Comment Monitoring

Heuristic search in a cognitive model of human parsing

برچسب‌زنی خودکار نقش‌های معنایی در جملات فارسی به کمک درخت‌های وابستگی

عنوان ژورنال:

اشتراک گذاری